Going far with open source tools in NLP

Krum Arnaudov, Data Scientist at Financial Times, 2023

About me

  • Joined Financial Times as Data Scientist in 2021
  • Past jobs as DS at Amplify Analytics, before that Account & Ops Management at HPI
  • ML knowledge largely self-taught via free online content + the AI specialisation at the SoftUni.
  • Main hobbies: my kids, some music, local politics

Topic today

  • Work with a NLP dataset.
  • Discuss central practical NLP concepts.
  • Build a small app.
  • Repo

Before we start - Definition

Open-source software (OSS) is computer software that is released under a license in which the copyright holder grants users the rights to use, study, change, and distribute the software and its source code to anyone and for any purpose.

Before we start - Motivation

Step 1 - Data

  • Recipe Box dataset

  • Data Example:

    
      "p3pKOD6jIHEcjf20CCXohP8uqkG5dGi": {
        "instructions": "Toss ingredients lightly and spoon into a buttered baking dish. ...",
        "ingredients": [
          "1/2 cup celery, finely chopped",
          "1 small green pepper finely chopped",
          "..."
        ],
        "title": "Grammie Hamblet's Deviled Crab",
        "picture_link": null
      }

Data Preprocessing

# imports and json files parsing skipped
# see data.preprocess_data.combine_json_to_dataframe.py

# Combine the data from the three JSON files
data = {**fn_data, **epi_data, **ar_data}

# Convert the data to a dataframe
df = pd.DataFrame.from_dict(data, orient='index')

# Add a new column with the concatenated text
df['full_text'] = ('Recipe title: ' + 
                        df['title'] + 
                        '. Ingredients: ' + 
                        df['ingredients'].apply(lambda x: '; '.join(x)) + 
                        '. Instructions: ' + 
                        df['instructions'])

df = (df.
      # remove adds
      pipe(remove_advertisement).
      # drop the picture_link column
      drop(['picture_link'], axis = 1).
      # give a num_words estimation
      assign(num_words = lambda d: d['full_text'].str.split().str.len()).
      # drop short recipes
      loc[lambda d: d['num_words'] > num_words_cutoff]
)

Data Preprocessing

#| echo: true

# imports and json files parsing skipped
# see data.preprocess_data.combine_json_to_dataframe.py

# Combine the data from the three JSON files
data = {**fn_data, **epi_data, **ar_data}

# Convert the data to a dataframe
df = pd.DataFrame.from_dict(data, orient='index')

# Add a new column with the concatenated text
df['full_text'] = ('Recipe title: ' + 
                        df['title'] + 
                        '. Ingredients: ' + 
                        df['ingredients'].apply(lambda x: '; '.join(x)) + 
                        '. Instructions: ' + 
                        df['instructions'])

df = (df.
      # remove adds
      pipe(remove_advertisement).
      # drop the picture_link column
      drop(['picture_link'], axis = 1).
      # give a num_words estimation
      assign(num_words = lambda d: d['full_text'].str.split().str.len()).
      # drop short articles
      loc[lambda d: d['num_words'] > num_words_cutoff]
)

Step 2 - Document Representations

  • What is a document?
  • What is a representation?

Document Representation Goal

Find numerical representation of the documents such that semantically similar documents are close.

And semantically dissimilar documents - far.

What does semantically similar mean?

Document representations

Document representations

Sentence Transformers

A family of language models finetuned specifically to produce document representations.

  • Great for semantic search, clustering, classification…
  • Work with multilanguage models as well. Images too!
  • Build to work well with Huggingface and friends.

Sentence Transformers

SentenceTransformers - Triplet loss example

{'set':
 {'query': 'What can I do to get better grades?',
  'pos': 
['How do I improve my grades?'],
  'neg': 
['Why do I get bad grades even though I study a lot?',
   'How can I get better grades in maths?',
   'How serious is forging high school grades?']
}
}

Loss functions

Sentence Transformers - Choices

Choices - MTEB

Massive Text Embedding Benchmark (MTEB)

Sentence Transformers API

from sentence_transformers import SentenceTransformer

# download model
vectoriser = SentenceTransformer("all-MiniLM-L12-v2")
# ensure that the model vectorises up to 512 tokens
vectoriser.max_seq_length = 512

docs = [rec for rec in recipe_data.full_text]
embeddings = vectoriser.encode(docs, show_progress_bar=True)

Sentence Transformers - Notes

  • Up to max. 512 tokens (ca. 400 words). The rest of the text gets truncated.
    • Different max_seq_length defaults, can increase to 512.
  • For longer docs - consider splitting (by paragraph, overlapping windows, etc…)
    • TF-IDF is a strong baseline.

How do you know if you’ve done a good job?

Option 1 - Visualize

  1. Reduce dimensions to 2 (UMAP seems to be the best option, TruncatedSVD for sparse data)
  2. Create an interactive plot:
  • good options - altair, bokeh, plotly

Embeddings visualizations with Bulk

  • Bulk - visualisation + initial data labelling in one.
  • Works with images too!
  • Steps:
    1. Reduce dimensions to 2 with UMAP.
    2. Add a text column.
    3. Save as a .csv
    4. python -m bulk text data/bulk_st.csv

Further know your dataset

Topic modelling with BERTopic

  • BERTopic - like a lego for topic modelling
  • Embrace modularity
  • Fast, when you already have the embeddings

BERTopic - simple

# Reduce stopwords in labels
vectoriser_model = CountVectorizer(stop_words="english")

topic_model = BERTopic(
    vectorizer_model=vectoriser_model
    )

topics, probs = topic_model.fit_transform(docs, embeddings)

BERTopic - advanced

# Reduce stopwords in labels
vectoriser_model = CountVectorizer(stop_words="english")
# Customize clustering
hdbscan_model = HDBSCAN(min_cluster_size=10, 
                        min_samples = 1, # This to reduce outliers as much as possible
                        cluster_selection_epsilon = 0.1, # Reduce number of clusters
                        metric='euclidean', 
                        prediction_data=True)

# Improve label representation 
sentence_model = SentenceTransformer("all-MiniLM-L12-v2")
sentence_model.max_seq_length = 512
representation_model = MaximalMarginalRelevance(diversity=0.2)

topic_model = BERTopic(
    vectorizer_model=vectoriser_model,
    hdbscan_model = hdbscan_model,
    min_topic_size=20,
    n_gram_range=(1, 2),
    embedding_model=sentence_model, 
    representation_model=representation_model
    )

topics, probs = topic_model.fit_transform(docs, embeddings)

Topic modelling with BERTopic

Annotate your dataset

Annotation

Any process of adding metadata tags to your text data can be called annotation.

Different examples of annotation:

  • Text classification
  • Named entities
  • Entity linking

Some work - already done

  • Use Bulk
  • Use BERTopic
  • Third option - annotation tools

Annotation Tools

  1. Paid tools:
  2. Open Source tools:

Pigeon - annotate in Jupyter

annotations = annotate(
  recipe_data.full_text.sample(100),
  # 1 - Very Easy, 2 - Kinda Easy, 3 - Moderate to Hard, 4 - Hard
  options=['1', '2', '3', '4'],
  # The below is needed just because the recipes are long and tough to see.
  display_fn=lambda x: pprint(x)
)

Few-shot learning with SetFit

Zero-shot vs. Few-shot learning

SetFit - How Does It Work?

  1. Get some labelled data (10-15 examples per class are fine).
  2. Fine-tune a SentenceTransformer with the labelled data.
  • Intuition - Think moving embeddings closer together, based on the labelled data.
  1. Add a classification model (head) - can be any scikit-learn classifier, pytorch layers, etc…
  2. Train the classification model with the fine-tuned SentenceTransformer.

Few-shot learning with SetFit

# define the model
model_id = "sentence-transformers/all-MiniLM-L12-v2"
model = SetFitModel.from_pretrained(model_id)
model.model_body[0].max_seq_length = 512

# get the dataset - ca. 15 examples per 4 categories 
# 1 - easy to 4 - hard
annotated_df = pd.read_parquet("https://raw.githubusercontent.com/krumeto/oss_nlp_tools_demos/main/data/recipe_classes.parquet")
train_dataset = Dataset.from_pandas(annotated_df)

# Train the model
trainer = SetFitTrainer(
    model=model,
    train_dataset=train_dataset,
    loss_class=CosineSimilarityLoss,
    num_iterations=20,
    batch_size = 5, # Reduce the batch size due to memory issues
    column_mapping={"recipe": "text", "label": "label"},
)
trainer.train()

complicated_recipe = """Some complicated recipe"""

trainer.model.predict([complicated_recipe])

Deploy

Streamlit - why do I love it?

  1. Do your usual flow

  2. Sprinkle some streamlit calls

    • Love the docs & blog
  3. Get your app.

  4. Stakeholders and teammates love it!

Steamlit notes

Utilize caching - @st.cache_resource and @st.cache_data and wrap in functions

@st.cache_resource
def load_model():
    model = SentenceTransformer("all-MiniLM-L12-v2")
    model.max_seq_length = 512
    return(model)

Scale

Indexes

Indexes have two main components:

  • Storing docs + docs embeddings + metadata
  • Scaled Search implementations:
    • KNN (also on GPU )
    • ANN (different algos like HNSW (Hierarchical Navigable Small World))
  • See e.g. this for index choices

Vector Stores

  • Indexing implementations:
    • ElasticSearch
    • FAISS (Facebook, but open source)
    • OpenSearch
    • Chroma
    • SimSity (small but cute)
  • Some paid ones with free tiers:
    • Pinecone
    • Qdrant

Example 1 - SimSity

from simsity import create_index, load_index

# simsity requires the encoder class to have a `transform` method, thus this simple wrapper.
class Encoder:
    def __init__(self, model_name, max_seq_length = 512):
        self.model = SentenceTransformer(model_name)
        # Ensure we increase the max_seq_length to the maximum 512 to handle the long recipes
        self.model.max_seq_length = max_seq_length
    
    def transform(self, data:list):
        return self.model.encode(data)

encoder = Encoder(model_name="all-MiniLM-L12-v2")

# Populate the ANN vector index and use it. 
index = create_index([rec for rec in recipe_data.full_text], 
                     encoder,
                     path="../embeddings/"
                     )

test_recipe = """
Some recipe
"""
index.query(test_recipe, n=3)

Example 2 - ChromaDB + LangChain

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L12-v2")
db = Chroma.from_texts(texts=[doc for doc in recipe_data.full_text], 
                           embedding=embeddings,
                           persist_directory="../embeddings/chromadb"
                           )
test_recipe = """
Some recipe
"""
results = db.similarity_search_with_score(test_recipe)

for doc in results[:3]:
    print("#"*50)
    print("Distance: ", doc[1])
    # First 100 chars of the doc
    print(doc[0].page_content[:100])

A note on LangChain

Summary

Value in NLP can be created with Open Source Tools.

Good embeddings open up opportunities.

Search is much more accessible than chat. Might also be more powerful.

Q&A

Additional Slides (could be messy, but kept for the curious)

Old School - Term frequency

 

Old School - Inverse Document Frequency

 

Old School - TF-IDF

 

 

 

Old School - TF-IDF

Using sklearn’s TfidfVectorizer:

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

vectoriser = TfidfVectorizer(
    stop_words='english', # default is without it, but this decreases the dictionary size significantly
    min_df = 2, # Ignore terms that have a document frequency strictly lower than the given threshold. When float, proportion of docs.
    max_df = 0.95, # ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).
    ngram_range=(1,2), # uni and bi-grams
    max_features=30_000, # unigrams are ca. 22K, get top 8000 bigrams
    dtype=np.float32 # Reduces the size of the resulting array without much quality sacrifice, default is float64
)

embeddings = vectoriser.fit_transform(recipe_data.full_text)

Common Sparse Representation Issues

  • Frequently mentioned:

    • Lexical gap: UK, United Kingdom, England
    • Word order not preserved
  • Actual issues:

    • Requires a “vocabulary” of size (n_training docs, nr_tokens_retained) in memory e.g. (50 000, 100 000)
    • Huge vectors - tough for clustering tasks, store and scale.

Dense Document Representations

  • f(Text) -> Representation^n
  • n dimensional representation
  • Find function f such that semantically similar text is close

Why do I strongly dislike closed source

  • Not the spirit of ML.
  • Not the spirit of SW.
  • Danger of knowledge loss in the long term

Also

What is a open source language model?

  • Definition
  • Open Source is a scale:
    • Sentence Transformers are open source (Apache 2.0 License)
    • Huggingface’s BLOOM is NOT open source
    • But OpenAI’s GPT-3.5/4 is as closed source as it gets
      • WTF is that - https://openai.com/policies

Open Source LLM Landscape